Network-aware selective job checkpoint and migration to enhance co-allocation in multi-cluster systems

نویسنده

  • William M. Jones
چکیده

Multi-site parallel job schedulers can improve average job turn-around time by making use of fragmented node resources available throughout the grid. By mapping jobs across potentially many clusters, jobs that would otherwise wait in the queue for local resources can begin execution much earlier; thereby improving system utilization and reducing average queue waiting time. Recent research in this area of scheduling leverages user-provided estimates of job communication characteristics to more effectively partition the job across system resources. In this paper, we address the impact of inaccuracies in these estimates on system performance and show that multi-site scheduling techniques benefit from these estimates, even in the presence of considerable inaccuracy. While these results are encouraging, there are instances where these errors result in poor job scheduling decisions that cause network over-subscription. This situation can lead to significantly degraded application performance and turnaround time. Consequently, we explore the use of job checkpointing, termination, migration, and restart (CTMR) to selectively stop offending jobs to alleviate network congestion and subsequently restart them when (and where) sufficient network resources are available. We then characterize the conditions and the extent to which the process of CTMR improves overall performance. We demonstrate that this technique is beneficial even when the overhead of doing so is costly. Copyright c © 2009 John Wiley & Sons, Ltd.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Using BeoSim to Evaluate Bandwidth-aware Meta-schedulers for Co-allocating Jobs in a Mini-grid∗

Clusters of commodity processors have become fixtures in research laboratories around the world. Collections of several co-located clusters exist in many larger laboratories, universities, and research parks. This co-location of several resource collections naturally lends itself to the formation of a mini-grid. A mini-grid is distinguished from a traditional computational grid in that the mini...

متن کامل

Merging Similarity and Trust Based Social Networks to Enhance the Accuracy of Trust-Aware Recommender Systems

In recent years, collaborative filtering (CF) methods are important and widely accepted techniques are available for recommender systems. One of these techniques is user based that produces useful recommendations based on the similarity by the ratings of likeminded users. However, these systems suffer from several inherent shortcomings such as data sparsity and cold start problems. With the dev...

متن کامل

Characterization of Bandwidth-aware Meta-schedulers for Co-allocating Jobs in a Mini-grid∗

In this paper, we present a bandwidth-centric job communication model that captures the interaction and impact of simultaneously co-allocated jobs in a grid. We compare our dynamic model with previous research that utilizes a fixed execution time penalty for co-allocated jobs. We explore the interaction of simultaneously co-allocated jobs and the contention they often create in the network infr...

متن کامل

Co-allocation with Communication Considerations in Multi-cluster Systems

Processor co-allocation can be of performance benefit. This is because breaking jobs into components reduces overall cluster fragmentation. However, the slower inter-cluster communication links increase job execution times. This leads to performance deterioration which can make co-allocation unviable. We use intra-cluster to inter-cluster communication speed ratio and job communication intensit...

متن کامل

A Hidden Node Aware Network Allocation Vector Management System for Multi-hop Wireless Ad hoc Networks

Many performance evaluations for IEEE 802.11distributed coordination function (DCF) have been previouslyreported in the literature. Some of them have clearly indicatedthat 802.11 MAC protocol has poor performance in multi-hopwireless ad hoc networks due to exposed and hidden nodeproblems. Although RTS/CTS transmission scheme mitigatesthese phenomena, it has not been successful in thoroughlyomit...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Concurrency and Computation: Practice and Experience

دوره 21  شماره 

صفحات  -

تاریخ انتشار 2009